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Abstract 

For survival data with high-dimensional covariates, results generated in the analysis 
of a single dataset are often unsatisfactory because of the small sample size. Integrative 
analysis pools raw data from multiple independent studies with comparable designs, 
effectively increases sample size, and has better performance than meta-analysis and 
single-dataset analysis. In this study, we conduct integrative analysis of survival data 
under the accelerated failure time (AFT) model. The sparsity structures of multiple 
datasets are described using the homogeneity and heterogeneity models. For variable 
selection under the homogeneity model, we adopt group penalization approaches. For 
variable selection under the heterogeneity model, we use composite penalization and 
sparse group penalization approaches. As a major advancement from the existing 
studies, the asymptotic selection and estimation properties are rigorously established. 
Simulation study is conducted to compare different penalization methods and against 
alternatives. We also analyze four lung cancer prognosis datasets with gene expression 
measurements. 

Keywords: Integrative analysis; Homogeneity and heterogeneity models; Penalized selec¬ 
tion; Consistency properties. 

1 Introduction 

In survival studies, data with high-dimensional covariates are now commonly encountered. A 
lung cancer prognosis study with gene expression measurements is presented in this article, 
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and more are available in the literature. With such “large p, small n” data, results generated 
i n the analysis of a single da t aset are o f ten unsatisfacto r y beca use of the small sample size 
( Guerra and Goldstein . 20091: Liu et all 2013 : Ma et al.l 2011b). For outcomes of common 
interest, there are often multiple independent studies with comparable designs. This makes 
it possible to pool multiple datasets, increase sample size, and improve over single-dataset 
analysis. As a family of multi-dataset analysis methods, integrative analysis methods pool 
and analyze raw data from multiple studies and outperform classic meta-analysis methods, 
which analyze multiple datasets separately and then combine summary statistics. 

In this article, we conduct the integrative analysis of multiple independent survival 
datasets under the accelerated failure time (AFT) model. The analysis goal is to identify, out 
of a large number of measured covariates, important markers associated with survival. For 
such a purpose, we adopt penalization, which has been the choice of many high-dimensional 
studies. A large number of penalization methods have been developed for single-dataset 
analysis. However because of the multi-dataset settings and heterogeneity across datasets, 
they are not applicable to integrative analysis. The sparsity structures of multiple datasets 
can be described using the homogeneity and heterogeneity models. Different models demand 
marker selection with different properties and hence different methods. This makes integra¬ 
tive analysi s even mo r e complicated. Penaliza tion methods for integrative analysis have been 


developed flLiu et al.l. l2013l: iMa et al.l. l2011b|) . however, in an unsystematic manner. 


This study advances from the existing ones in the following aspects. First, it advances 
from single-dataset analysis and meta-analysis by conducting integrative analysis of multiple 
heterogeneous datasets. Second, it conducts i nore s y stematic investiga tion than the existing 
integrative analysis studies such as Liu et ah ( 20131) : Ma et all ( 2011b ). More importantly, it 
rigorously establishes the selection and estimation properties which have not been previously 
examined. The theoretical development is nontrivial because of data complexity, model 
settings, and penalties. Third, the properties of composite penalization and sparse group 
penalization have not been studied for single-dataset analysis under the AFT model. Thus 
our study can also provide insights for single-dataset penalization methods. Fourth, this 
study also advances from the existing studies by conducting systematic simulations and 
direct comparisons of multiple methods. 

Data and model settings are described in Section 2. Penalized integrative analyses under 
the homogeneity and heterogeneity models are investigated in Section 3 and 4 respectively. 
We conduct numerical study in Section 5. The article concludes with discussions in Section 
6. Technical details and additional analysis results are provided in Appendix. 


2 Integrative analysis under AFT model 

Gonsider the integrative analysis of survival data from M independent studies. In study 
m(= 1,..., M) with nm iid subjects, let T'" = (T™, • • • , be the logarithm of failure 

times and X™ G be the predictor matrix. Assume the AFT model 
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/3"* is the vector of regression coefficients, and e™ is the vector of random errors. With proper 
normalization, the intercept term has been omitted. Assnme that all datasets measure the 
same set of covariates. Then pi = ■ ■ • = vm = p- W hen different datasets have mismatched 


covariate sets, a rescaling approach (IMa et ahl. l2011al: iLiu et al.l. 120131) can be adopted. The 
proposed approaches are then applicable with minor modihcations. 

Let (3 = (/3^, • • • , (3^) = (/3i, • • ■ , f3p)~^, where f3j = (/5j, • • • , consists of the coeffi¬ 
cients of variable j in all M datasets. Moreover, write (3 = {/3ij)pxM with its true value (3*, 
where (3ij = (3^. With the heterogeneity across datasets, is not necessarily equal to (3j 
for m ^ k. Under right censoring, one observes (W™, (J™', X"*) with = T™ A C™, where 
C™ is the vector of log censoring times, and <5™ = 1{T™ < C™}. 

When the d i stribu tion of random errors is unknown, there are multiple e stimat i on ap - 
proaches (lYind. Il9931) . We adopt the weighted least squares (LS) approach flStutd . Il9931) . 
which has the lowest computational cost and is desirable with high-dimensional data. Let F™ 
be the Kaplan-Meier estimator of the distribution function F™ of T™. Let Yj.™ < ■ ■ ■ < ^ 

be the order statistics of Yj™'’s. F™- can be written as F'^{y) = ^ 2/}; 
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where cnU’s are expressed as cnU = — and cnj" = ■, i i i,- i i i 

* ^ ^ nm ^ Um-i + l rij = l yrim-J + l 

Here - are the associated censoring indicators of the ordered Yj™’ 

Wm = diaglrimCi;™, • • • Then for the M datasets combined, the weighted LS ap 

proach is to minimize 
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m=l 


X^l3 ^)' 1Y^(Y™ - 


( 2 ) 


Note that the components of Y™ and X™' need to be sorted. Assume that: 

[Condition 1] (a) The Um components of e”^ are i.i.d. and sub-Gaussian with noise level 

am- That is, for all vector u with ||iz ||2 = 1 and any f > 0, P{\iy'^e'^\ > t) < 2exp 
(b) e™' is independent of Wm- 

The total sample size is n = J2m=i ’^rn- The important predictor index sets of M datasets 


M 


are respectively labeled as Si,-- - ,Sm- Then S' = IJ Sm denotes the important set with 


m=l 


its corresponding variables important in at least one dataset. Let and [S'! denote the 
complement and cardinality of set S, respectively. Let A = ■ (3*j ^ 0} and B = 

{(qj) \ i G S', J = 1, • • • ,M}. Let f3ji and /3 b denote the components of (3 indexed by A 
and B, respectively. For a p x 1 vector v and index set / C {1, • • • ,p}, let vi denote the 
components of v indexed by I. Moreover, let X™’* denotes the transposition of the Ah row 
of X™'. Then for any index set I C {1, • • • ,p}, X™ = (X™’^, • • • , . 


2.1 Homogeneity and heterogeneity models 

The sparsity structure of /3 can be described using the homogeneity and heterogeneity 
models. Under the homogeneity model, /3”^’s have the same sparsity structure. That is. 
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/(/9™ = 0) = I{l3j = 0) for all (m, A;, j)’s. The intuition is that if the M datasets are “close 
enough”, then the same set of markers should be identihed in all datasets. Under this model, 
we only need to determine whether a covariate is important or not, that is, only one level of 
selection is needed. With the (sometimes great) differences across datasets, the homogeneity 
model may be too restricted. As an alternative, the heterogeneity model allows different 
datasets to have different sparsity structures. It includes the homogeneity model as a special 
case and can be more flexible. Under this model, we need to determine whether a covariate 
is associated with any response at all. In addition, for an important covariate, we need to 
determine in which datasets it is important. That is, a two-level selection is needed. 


3 Integrative analysis under the homogeneity model 

Under this model, one-level selection is needed and can be achieved using group penal¬ 
ization. In terms of formulation and computation, the development of group penalization 
methods in integrative analysis sh are some similarity with that in single-dataset analysis 
(IBuhlmann and van de Geerl . 1201 ll ). However, with the signihcantly different data settings 
and adoption of the AFT model, the theoretical development has significant differences. 


3.1 Group LASSO 

Consider the group LASSO penalized objective function 


M 




m=l 


j=l 


1 /2 

where A is the tuning parameter and \\f3j\\2 = + • • • + 

For set S, define the estimate /3 b = (/3^, • • • , f3^) as 


M 


0B = arg - X^D’SYW„(Y’- - + A ^ ||/3 


'm 
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j&s 


(3) 


(4) 


For group LASSO to be able to consistently identify the true sparsity structure, there needs 
a local solution ^ ^^g^asso^ ^g^asso^ where and = 0. Dehne 




Theorem 1 Consider the estimator defined by minimizing Under Condition 1, 
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1. There exists a local minimizer /3b of Q) such that 


4 n 


M 


Pr < 11/3? - 0Th < = 1, ■ ■ ■ , M i > 1 - exp 

P.\ 


m=l 


X‘^\S\n^ 

2al,p^n„ 


2. Assume the ir-representable conditions ijjrn < Dm < 1- = {/3g“^°,/3gi“^°} with 


^qlasso A ^qlasso n - 7 7 ■ ■ ■ i* 

/ 3 b = PB: Pb<^ = U is a local mimmizer of 


'B 

M 

X^exp 

m=l 


with probability at least 


X^\S\n^ 


M , 

- 2p ^ exp <^ - 

m=l ^ 


n2A2(l - 


2'n-m-^m^m4“ 33y; 


In single-dataset analysis, I Zhao and Yd (120061) and followup studies establish selection consis¬ 
tency under the ir-representable condition. Under a similar condition for individual datasets, 
integrative analysis also has selection consistency. 

With the probability bounds in Theorem 1, we can obtain a more straightforward under¬ 
standing of the penalized estimators and derive the following result. 

Corollary 1 Suppose that for m = 1, ■ ■ ■ , M, p™, , and are bounded away from zero 

and infinity. Assume that n/um = 0(1), \S\ <C n, and logp = 0(?7,“) with a < 1. Un¬ 
der Condition 1 and the ir-representable conditions in Theorem 1, if j^iin ||/3*||2 3> 

/-y _ p ^ 

X 3> n~^, then group LASSO can identify the true sparsity structure and 11/3^ — (3^*\\2 = 

Op(A^])^), m = 1, • • • ,M. 

Remark 1 It is known that in single-dataset analysis the group LAS SO is grouv s e lection 
consistent under some variants of the ir-representable condition. See Huana et al. I 201^) 
and others for reference. Similar conditions are needed in the integrative analysis with group 
LASSO. The conditions in Corollary 1 on and Am are on the design matrixes and 

censoring probabilities. Corollary 1 shows that even when the group LASSO can identify the 


true sparsity structure, X should be much large than n ^ leading to ||/3' 


am* 

'Ps 


I 2 > ^/\S\/n. 


3.2 Concave 2-norm group selection 

Consider penal ization built on c oncave pena lties. Notab le examples of concave penalty 
include SCAD ( Fan and Li . 2001 1 and MCP f Zhanel. 2010). For t > 0, the SCAD penalty 

has hrst order derivative p'x{t) = A |/(t < A) -|- > A)|, for some a > 2. The MCP 

has derivative p'x{t) = A (l — ^) ,, for some a > 1. Consider the objective function 
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M 


^(/5) = ^ + 5^Pa(||/3,||2), 

m=l j=l 


(5) 
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where the penalty P\{-) satisfies: 

[Condition 2] X~^px(t) is concave in t g [0, oo) with a continuous derivative X~^p'x(t) 
satisfying A“Va( 0+) ^ (0, cxo). In addition, X~^p'x(t) is increasing in A G (0,+cxo), and 
A“Va( 0+) is independent of A. 

[Condition 3] 6^ = inf : A“Va(^) = 0^ t > 0} is bounded. 


Remark 2 Condition 2 is also considered bv lFan and Lu 1(20li) . LASSO, SCAD, and MCP 
all satisfy this condition. Condition 3 is added to guarantee unbiasedness. LASSO does not 
satisfy Condition 3 since A“Va(^) = 1 le-ads to 9 = oo, while SCAD and MCP satisfy wit h 


9 = a. Another approach that has been studied is the 2-norm group bridge (Ma et al . 201^) . 


Under certain conditions, its selection consistency is established in Ma et a,l. ( 2011 A) . Note 
that the bridge penalty does not satisfy Condition 3 and needs to be separately investigated. 


Consider the properties of concave 2-norni group penalization. Define the oracle estimator 
as with = ps and = 0, where 




arg min^g 





\T 






m=l 


( 6 ) 


Theorem 2 Under Condition 1-3, consider the estimator defined by minimizing 
1. For any Rm < \ ^! have 



M 

Pr| \\I3'^ - prW < \l—Rra, m=l,..-,M| >l-^exp 

m=l 


-R 




m\2 




min||P*l |2 

2. Suppose X < ^ 


min IIP* IP 


29 


M 


1 - ^ exp 


and R}^ < vW' probability at least 

n'^Px{0+) 


|g|(g)C« 


M 


m=l 


Ffmrr2 
m 


^oracle ^ local minimizcr of ^ 


2p exp 


m=l 


2,Tlm-^m^rni.^ “h V^m) 


Theorem 2 can be used to derive the following asymptotic result. 

Corollary 2 Suppose that for m = 1, • • • , M, pff, plf- and are bounded away from zero 
and infinity, n/um = 0(1), \S\ n, logp = 0{n°‘) with a < 1, and = 0(n“i) with 

min 11/3^ II 

Oi G [0,1/2). Under Condition 1-3, if X < — and X S> '""C then the concave 2- 

norm group selection can identify the true sparsity structure and — (3^*\\2 = Opi\/^)- 
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Remark 3 When the concave penalty is used, the upper bound of'ijjm can grow to oo at rate 
0(n°‘^). In contrast, the group LASSO needs the ir-representable conditions. Moreover, the 
group LASSO yields a larger bias than the concave 2-norm group selection. 


4 Integrative analysis under the heterogeneity model 


Under this model, two-level selection is needed and can be achieved using composite pe¬ 
nalization and sparse group penalization. Properties of composite penalization have been 
studied in single-dataset analysis, however, under much simpler data and model settings. For 
sparse group penalization built on concave penalties, properties have not been established 
for single-dataset analysis. 

Dehne the oracle estimator /3 = {/3^, 0} where 


M 


8 ^ = argmin^^ .1 _ V" - 


(7) 


m=l 
2 


Define pi” = W'm'JfS.}. PP = } and 

Theorem 3 Consider the estimator defined in Under Condition 1-3, we have 

M 


pMife-/3ni2< 






-Cm, m = 1, • • • , M > > 1 - ^ exp -U, 


m=l 




with Cm < 



Corollary 3 Suppose that for m = 1, • • • , M, p*™ and are bounded away from zero and 

infinity, n/um = 0(1), and [S'] C n. Under Condition 1-3, —/3^*||2 = for 

m = 1, • • • , M. 


4.1 Composite penalization 

Consider the objective function 

, M P / M \ 

m=l j=l \m=l / 

where the outer penalty po.Xoi') determines the overall importance of a variable, and the 
inner penalty ppxjf) determines its individual importance. Aq and A/ are tuning parameters. 
A specific example is the composite MCP (cMCP) where both po,Xo and pi^Xj are MCP. 
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are bounded. 


[Condition 4] 9o = inf | = o| and 9i = inf |||- ; = o 


M 


Denote J '^ = max ^ ^ /(/3* 7 ^ 0), j G S' — S'm ^ and /™'“ = maXiP/^Aj(^)- 

i^m 


Theorem 4 Consider the minimizer of Assume Condition 1-2 and f. Set 


Cl < 


min \B‘C*\ 
ij,m)eA ^ 


_ min 1/5, 

Um, , {j,m)eA 

, Ai < ■ 


m* I 

j 


\s, 


29i 


Ao0o>/l"“max(J-™). 


Then jS is a local minimizer with probability at least 1 — r^, where 


M 

^2 = 

m=l 


^t2 

irv) 


l^rnKp; 


*m\2 


M 


+ 21 S 15 ; 


exp 


m=l 


nyhA^+)p'oAo^J-'"fTn 


2n^p1pal{l + fj. 


* ]2 
m) 


2n™A^cT^(l+i5;;)2 j 


This theorem establishes the consistency of composite penalized estimates. A simplified 
statement is provided in the following corollary. 

Corollary 4 Suppose that for m = 1, • • • , M, and are bounded away from 

zero and infinity, n/um = 0(1), [S'! <C n, \ogp = OijC) with a < 1, and fjf, = 0(n“i) 

min 

with ai G [0,1/2). Under Condition 1,2 and f, if Xi < ' Xo9o = and 

Qt 

A/Ao ^ '■"C composite penalization can achieve the two-level selection consistency. 

Remark 4 Liu et al. (201^) also suggests the composition of MCP and LASSO. We con¬ 
juncture that it is estimation consistent, can consistently identify the overall importance of 
variables, but in general is not consistent at the individual level. 


4.2 Sparse group penalization 

Consider the objective function 

, M p p M 

^1/5) = - x"'/3">) + Epi-A.dlftlb) + J2J2p2.AWT\)- W 

m=l j=l j=l m=l 

Al and A 2 are tuning parameters. Here the penalty is the sum of group and individual 
penalties. The first penalty determines the overall importance of a variable, and the second 
penalty determines its individual importance. 

Consider penalties pi_Ai and P 2,\2 satisfy Condition 2 and 4 with bounded constants 
9i and 02 - Consider the estimator defined by minimizing ([9]). 



















Theorem 5 Suppose that Condition 1-2 and 4 hold. Set 
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Then 0 is a local minimizer with probability at least 1 — where 
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X^exp 
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n\2 
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2n^p^”^cT^(l + J 


M 


+2(p-|S|)5^exp 


m=l 


^^ Kai ( 0 +) + P 2 , A 2 ( 0+)]^ 1 

2nn,k^al^{l + J 


That is, the sparse group penalization also enjoys the consistency properties. For theoretical 
purpose, pi^Ai and P 2 ,X 2 do not need to take the same form. However using the same pi^Ai 
and p 2,\2 niay facilitate computation. We then derive the following asymptotic result. 

Corollary 5 Suppose that for m = 1, • • • , M, and are bounded away from 

zero and infinity, n/nm = 0(1), IFI <C n, logp = OijC) with a <1, and with 

min||/3i*||2 ,min \l3J^*\ 

ai G [0,1/2). Under Condition 1-2 and 4, if < ^^^ 201 —’ "^2 < ^^’"'^202 -’ ^ n“ 2 +"i 

Ot 1 

and Ai + A 2 3> then the sparse group penalization achieves the two-level selection 

consistency. 


5 Numerical study 


5.1 Computation 

With the weighted LS approach, the loss function ([2]) has a least squares form. In single¬ 
dataset analysis with a LS loss, multiple computational algorithms have be en developed for 
group penalization, composite penalization , and s parse group penalization ( Friedman et al.l. 


2 OIOI : iBrehenv and Huand. l2009l: iLiu et al.l . 12014) . Here we adopt the existing gradient de 


scent algorithms wit h mino r modihcations. Convergence properties can be derived following 
Brehenv and Huand (120111) and references therein. Details are omitted here. The penaliza¬ 
tion methods involve the tuning parameter A(A/, Ao, Ai, A 2 ). The theorems provide results 
on the asymptotic order. MCP also involves the additional regularization parameter a. Fol¬ 
lowing the literature, we consider a small number of values for a, in particular including 1.8, 
3, 6 and 10. In numerical study, we use 5-fold cross validation for tuning parameter selection. 


5.2 Simulation 

We simulate three datasets, each with 100 subjects. For each subject, we simulate 1,000 
covariates. The covariates have a joint normal distribution, with marginal means equal 
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to zero and variances equal to one. Consider two correlation structures. The first is the 
auto-regressive (AR) correlation, where covariates j and k have correlation coefficient 
p = 0.2, 0.5, and 0.8, corresponding to weak, moderate, and strong correlations, respectively. 
The second is the banded correlation. Here three scenarios are considered. Under the hrst 
scenario, covariates j and k have correlation coefficient 0.3 if \j — k\ = 1 and 0 otherwise. 
Under the second scenario, covariates j and k have correlation coefficient 0.6 if \ j — k\ = 1, 0.3 
ii\j — k\ = 2, and 0 otherwise. Under the third scenario, covariates j and k have correlation 
coefficient 0.6 if \i — k\ = 1, 0.3 if \ j — k\ = 2, 0.15 if \ j — k\ = 3, and 0 otherwise. Both the ho¬ 
mogeneity and heterogeneity models are simulated. Under the homogeneity model, all three 
datasets share the same twenty important covariates. Under the heterogeneity model, each 
dataset has twenty important covariates. The three datasets share ten important covariates 
in common, and the rest important covariates are dataset-specihc. Under both models, there 
are a total of sixty true positives. The nonzero coefficients are randomly generated from a 
normal distribution with mean zero and variance 0.3125 and 1.25, representing low and high 
signal levels. The log event times are generated from the AFT models with intercept equal 
to 0.5 and N(0,1) random errors. The log censoring times are independently generated from 
uniform distributions. The overall censoring rate is about 30%. 

The simulated data are analyzed using group MCP (GMCP), composite MCP (cMCP), 
and sparse group MCP (SGMCP). In addition, we also consider two alternatives. The hrst 
is a meta-analysis method, where each dataset is analyzed separately using MCP, and then 
the analysis results are combined across datasets. The second is a pooled analysis method, 
where the three datasets are combined into a big data matrix, and then variable selection is 
conducted using MCP. Note that the differences across simulated datasets are smaller than 
those encountered in practice, which favors meta- and pooled analysis. We acknowledge that 
multiple other methods are applicable to the simulated data. The two alternatives have the 
closest framework as the proposed methods. 

Summary results based on 200 replicates are shown in Table 1 and 2. Performance of 
the integrative analysis methods as well as alternatives depend on the similarity of spar¬ 
sity structures across datasets, correlation structure, and signal level. As an example of 
the homogeneity model, consider the correlation structure “Banded 2” in Table 1. The ho¬ 
mogeneity model favors GMCP, which identihes 34.7 true positives with an average model 
size 45.2. The cMCP method identihes fewer true positives (30.5). A large number of false 
positives are identihed, with an average model size 149.7. SGMCP identihes 25.6 true posi¬ 
tives, with a very small number of false positives (average model size 27.4). In comparison, 
the meta-analysis and pooled analysis identify much fewer true positives (17.6 and 16.1, 
respectively). As an example of the heterogeneity model, consider the correlation structure 
“AR p = 0.5” in Table 2. The cMCP method identihes the most true positives (42.1 on 
average), but at the price of a large number of false positives (average model size 185.1). 
GMCP identihes 34.6 true positives. However by forcing the same sparsity structure across 
datasets, it also identihes a considerable number of false positives (average model size 61.0). 
SGMCP identihes 26.9 true positives with an average model size 30.2. The meta-analysis 
and pooled analysis methods identify fewer true positives. 
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5.3 Analysis of lung cancer prognosis data 


In the U.S., lung cancer is the most common cause of cancer death for both men and 
women. To identify genetic markers associated with the pr ognosis of lung cancer, gene 
prohling studies have been extensively conducted. We follow IXie et al.l (120111) and collect 
data from four independent studies with gene expression measurements. The UM (University 
of Michigan Cancer Center) dataset has a total of 92 patients, with 48 deaths during follow¬ 
up. The median follow-up is 55 months. The HLM (Moffitt Cancer Center) dataset has a 
total of 79 patients, with 60 deaths during follow-up. The median follow-up is 39 months. 
The DFCI (Dana-Farber Cancer Institute) dataset has a total of 78 patients, with 35 deaths 
during follow-up. The median follow-up is 51 months. The MSKCC dataset has a total of 
102 patients, with 38 deaths during follow-up. The median follow-up is 43.5 months. 

Gene expressions were measured using Affymetrix U122 plus 2.0 arrays. A total of 22,283 
probe sets were prohled in all four datasets. We hrst conduct gene expression normalization 
for each dataset separately, and then normalization across datasets is also conducted to en¬ 
hance comparability. To further remove noises and improve stability, we conduct a marginal 
screening and keep the top 2,000 genes for downstream analysis. The expression of each gene 
in each dataset is normalized to have zero mean and unit variance. 

We analyze data using cMCP (Table 3), SGMCP (Table S2.1), meta-analysis (Table 
S2.2), pooled analysis (Table S2.3), and GMCP (Table S2.4). Although there is overlap, 
different methods identify signihcantly different sets of genes. The cMCP method identihes 
more genes, particularly many more than SGMCP. Such a result hts the pattern observed 
in simulation. Unlike in simulation, we are not able to objectively evaluate the marker 
selection results. To provide further insights, we evaluate prediction performance using a 
cross-validation based approach. Specihcally, we split the samples into a training and a 
testing set with size 3:1. Estimates are generated using the training set samples and used to 
make prediction for the testing set samples. We separate the testing set samples into two sets 
with equal sizes based on The logrank statistic is computed, evaluating survival 

difference of the two sets. To reduce the risk of an extreme split, we repeat this process 
100 times and compute the average logrank statistics as 7.65 (cMCP), 4.95 (SGMCP), 5.35 
(meta-analysis), 5.2 (pooled analysis), and 6.45 (GMCP). All methods are able to separate 
samples into sets with different survival risk. The cMCP method has the best prediction 
performance (p-value 0.0057). 


6 Discussion 

In this article, we have studied the integrative analysis of survival data under the AFT 
model. The existing research on this topic has been scattered, and this study is the hrst 
to systematically study this complicated problem. Both the homogeneity and heterogene¬ 
ity models have been considered, along with multiple penalization methods. Signihcantly 
advancing from the existing studies, the present study rigorously establishes the selection 
and estimation consistency properties. Although some theoretical development has been 
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motivated by the existing studies, the heterogeneity across multiple datasets and specihc 
data and model settings make this study unique. Especially, the properties of sparse group 
penalization have not been studied in single-dataset analysis. Thus this study has both 
methodological and theoretical contributions. The computational aspect is similar to that 
in the literature and is largely omitted. Tuning parameter selection using cross validation 
shows reasonable performance in simulation and data analysis. Theoretical investigation on 
the consistency of cross validation is very much challenging and postponed. Another con¬ 
tribution is that this study directly compares different methods. The advantage of GMCP 
under the homogeneity model is expected. Under the heterogeneity model, cMCP may iden¬ 
tify a few more true positives, however, at the price of a large number of false positives. 
The theoretical study does not provide an explanation to this observation. More studies on 
hnite sample properties are needed. In simulation, a total of 24 settings are considered and 
show similar patterns. More extensive simulations may be pursued in the future. In data 
analysis, different methods identify different sets of genes. The observed patterns are similar 
to those in simulation. In addition, cMCP identihes the most genes but also has the best 
prediction performance. More extensive, especially biological studies may be needed to fully 
comprehend the data analysis results. In this study, we have focused on survival data and 
the AFT model. Extensions to other data and model are of interest to future study. 
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Appendix 

This file contains proofs (Section SI) for the theoretical resnlts described in the main 
text as well as additional nnmerical resnlts (Section S2). 

SI Proofs 


Let 


= and X’” = 

Then - X^f3^yWm{Y^ - can be rewritten as ||y”^ - where 

II is the £2 norm. Moreover, we can easily see that 


ym ^ ^ 

Proof of Theorem 1. First, we prove that 

4 n 


(S1.2) 


M 


where Ti = ^ exp 

m=l 


Pr <; 11/3- - (3Th < — 1, • • ■ > 1 - n. 

Recall that /3 b = argmin^gL(/3B), where 




2o-^ 


mP2^m 


M 


^ E lly” - 11/3,112. 


m=l 


j&s 


Let Tm = and 3 = {/3 b ; ||/3— — /3 —*||2 = m = 1, • • • , M}. It snffices to show 

that 

Pr ( inf L(/3 b) > L(/3^)) > 1 - n. 

\PbGJ / 

This implies that with probability at least 1 —Ti, L(/3b) has a local minimnm /3 b that satisfies 

11/3? - PTh < tor m = 1,... , M, 

Let u G with H^—II 2 = 1, m = 1, • • • , M. Define (3^ = f3^* + r^u^- Consider 

Q{ub) = n{L(/3B) — L(/3g)}. Obvionsly, it is eqnivalent to show that 

Pr ( inf Q{ub) >0^ > 1 — ti. (SI.3) 

\\U'^\\2=l, m=l,-,M / 
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Table 1: Simulation at the low signal level. In each cell, the first row is the number of true 
positives (sd), and the second row is the number of model size (sd). 


Correlation 

Meta 

Pooled 

GMCP 

cMCP 

SGMCP 

AR p = 0.2 

30.3(5.7) 

62.4(19.1) 

Homogeneity model 

29.0(8.4) 48.8(6.2) 42.6(4.2) 

56.5(29.3) 57.4(9.4) 193.2(13.9) 

36.5(6.7) 

39.1(8.3) 

AR p = 0.5 

20.4(6.0) 

38.7(17.7) 

18.3(6.7) 

31.2(16.3) 

39.5(7.9) 

50.8(12.4) 

33.3(8.1) 

160.6(83.0) 

28.6(6.9) 

30.9(9.1) 

AR p = 0.8 

10.9(2.6) 

17.9(6.1) 

10.3(3.3) 

15.5(6.4) 

24.8(7.7) 

34.4(12.8) 

18.3(4.1) 

75.4(59.2) 

16.8(5.2) 

18.6(7.2) 

Banded 1 

26.7(5.8) 

54.3(18.7) 

25.1(7.6) 

48.7(26.1) 

46.2(7.6) 

56.5(12.6) 

40.3(4.5) 

196.6(12.7) 

34.7(6.2) 

37.8(8.9) 

Banded 2 

17.6(4.5) 

30.4(11.6) 

16.1(5.0) 

25.4(12.5) 

34.7(8.3) 

45.2(13.7) 

30.5(6.0) 

149.7(95.0) 

25.6(5.9) 

27.4(7.2) 

Banded 3 

17.7(5.3) 

32.1(18.6) 

16.2(4.9) 

26.8(12.9) 

37.3(7.3) 

51.1(13.7) 

31.4(5.8) 

166.3(81.7) 

26.1(6.3) 

28.2(7.6) 

AR p = 0.2 

21.3(5.1) 

35.5(13.8) 

Heterogeneity model 

20.2(5.7) 26.0(9.0) 37.6(5.2) 

31.4(13.9) 53.0(20.3) 199.2(40.3) 

22.5(7.2) 

28.4(11.0) 

AR p = 0.5 

16.8(5.1) 

28.5(10.8) 

16.7(5.3) 

27.3(12.0) 

22.8(6.2) 

45.5(15.2) 

31.7(6.9) 

154.8(94.4) 

18.8(5.7) 

21.9(7.7) 

AR p = 0.8 

10.6(3.8) 

17.0(6.3) 

10.3(3.5) 

15.3(6.3) 

15.2(5.5) 

31.4(12.9) 

20.0(4.9) 

99.9(84.4) 

11.9(4.2) 

15.3(6.8) 

Banded 1 

20.4(4.8) 

35.2(15.2) 

19.9(6.0) 

31.3(13.9) 

25.2(6.7) 

48.9(14.5) 

35.3(6.7) 

172.2(77.9) 

20.9(6.0) 

24.9(7.9) 

Banded 2 

16.1(4.0) 

24.9(8.4) 

15.1(3.9) 

22.8(7.7) 

21.4(6.1) 

44.0(12.2) 

28.0(5.4) 

129.9(103.4) 

17.5(4.8) 

21.0(6.2) 

Banded 3 

15.9(3.6) 

26.8(10.8) 

15.2(4.4) 

24.3(10.2) 

20.2(6.0) 

43.3(14.2) 

27.1(6.2) 

102.7(115.7) 

17.8(4.9) 

22.3(7.5) 
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Table 2: Simulation at the high signal level. In each cell, the hrst row is the number of true 
positives (sd), and the second row is the number of model size (sd). 


Correlation 

Meta 

Pooled 

GMCP 

cMCP 

SGMCP 

AR p = 0.2 

39.4(4.5) 

49.9(9.3) 

Homogeneity model 

39.2(5.4) 58.3(2.3) 52.3(2.9) 
48.8(11.4) 60.1(4.2) 174.6(11.6) 

49.9(3.8) 

50.1(4.1) 

AR p = 0.5 

30.1(5.0) 

42.0(10.1) 

30.0(6.0) 

41.8(12.3) 

55.4(3.6) 

58.3(4.3) 

46.5(3.3) 

179.8(15.3) 

44.2(4.0) 

44.5(4.2) 

AR p = 0.8 

17.4(3.8) 

24.2(6.5) 

17.1(3.9) 

23.6(7.3) 

46.5(6.7) 

54.1(10.6) 

29.5(6.4) 

103.8(97.8) 

29.6(5.9) 

30.7(6.1) 

Banded 1 

36.9(4.7) 

47.3(8.4) 

35.9(5.1) 

43.7(7.6) 

57.2(2.7) 

58.7(4.4) 

50.3(2.9) 

178.4(12.1) 

47.9(4.3) 

48.3(4.4) 

Banded 2 

25.9(4.3) 

36.3(8.8) 

25.5(4.7) 

34.4(9.1) 

53.3(4.6) 

57.8(8.3) 

41.1(3.4) 

186.2(16.7) 

38.6(5.5) 

39.7(6.1) 

Banded 3 

27.1(3.8) 

37.3(8.4) 

26.5(4.3) 

35.8(8.3) 

53.7(4.5) 

57.8(7.0) 

42.4(4.4) 

179.8(21.4) 

40.8(4.7) 

42.0(5.6) 

AR p = 0.2 

34.4(4.1) 

39.7(6.0) 

Heterogeneity model 

34.0(4.1) 40.0(4.2) 48.91(3.2) 
37.9(4.8) 69.2(7.9) 180.4(18.9) 

33.9(4.6) 

36.6(4.7) 

AR p = 0.5 

25.9(4.5) 

32.7(6.6) 

24.1(5.9) 

29.5(7.3) 

34.6(5.7) 

61.0(9.8) 

42.1(4.1) 

185.1(18.0) 

26.9(4.8) 

30.2(6.2) 

AR p = 0.8 

16.4(3.4) 

22.2(5.1) 

15.6(3.5) 

21.3(6.5) 

23.7(5.6) 

44.3(10.3) 

26.8(5.3) 

157.5(87.3) 

17.5(4.4) 

20.9(5.6) 

Banded 1 

30.8(4.1) 

36.0(5.8) 

30.2(4.6) 

35.4(6.7) 

36.8(5.3) 

64.1(9.3) 

45.8(3.1) 

177.7(17.3) 

30.0(5.2) 

32.6(6.7) 

Banded 2 

22.9(4.6) 

29.3(7.8) 

22.4(4.1) 

27.5(5.4) 

32.1(5.9) 

57.4(8.4) 

36.6(4.3) 

169.2(51.2) 

25.2(4.9) 

28.6(5.3) 

Banded 3 

23.0(4.6) 

28.7(5.8) 

22.6(4.2) 

27.9(5.3) 

31.6(6.2) 

57.1(9.9) 

37.4(5.0) 

169.2(42.1) 

24.2(6.8) 

26.6(7.5) 
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Table 3: Analysis of lung cancer data using cMCP: identified genes and their estimates. 


Probe 

Gene 

UM 

HIM 

DFGI 

MSKGG 

201462_at 

SCRNl 


0.0045 



202637_s_at 

ICAMl 




0.0037 

203240_at 

FCGBP 


0.0024 



203876_s_at 

MMPll 




-0.0013 

203917_at 

GXADR 




0.0040 

203921_at 

GHST2 

0.0024 




204855_at 

SERPINB5 

-0.0008 




205234_at 

SLG16A4 




-0.0016 

205399_at 

DGLKl 




-0.0031 

206461_x_at 

MTIH 



-0.0008 


206754_s_at 

GYP2B6 



0.0048 


206994_at 

GST4 


-0.0017 



207850_at 

GXGL3 


-0.0155 



208025_s_at 

HMGA2 



-0.0016 


208451_s_at 

G4A 


0.0038 



208607_s_at 

SAA2 




0.0044 

209343_at 

EFHDl 




0.0028 

212328_at 

LIMGHl 



0.0028 


212338_at 

MYOID 


0.0019 



213338_at 

TMEM158 



-0.0003 


214452_at 

BGATl 




0.0004 

215867^_at 

GA12 


-0.0054 



218677_at 

S100A14 




-0.0081 

219654_at 

PTPLA 


-0.0109 



219747_at 

NDNF 

0.0001 




220952_s_at 

PLEKHA5 


-0.0018 



221841_s_at 

KLF4 

-0.0024 




222043 at 

GLU 


0.0008 




17 





Together with fISl.lD and (IS1.2p . we have 


M 




m=l 


+nA5^{|l/3* + ron,ll2-l|/3*|lJ 


j&s 


M 


M 




m^,m 
S 


m=l 

+nA |||/3* + r 
j&s ^ 

=: Qi + Q2 + Q3, 


o u 


'^'112 


m=l 

1/3*11 
I 11 2 


(S1.4) 


where r = (ri,--- ^tm), and o denotes the Hadamard (component-wise) product. Write 


M 


Qi = Y 1 Qim where Qim = Wne”'. Note that || WnX^ With 

m=l 

the sub-Gaussian tail as specihed in Condition 1, we have for any given Sm 


Pr(|Qim| > rr^Ern) < 2 exp 


2a^||W^X>™||i 

Together with the Bonferroni’s inequality, we have 

MM M 


< 2 exp 


2nmpf(Tl 


m=l 


m=l 


Pr(Qi ^ ^ ^m^m) — ^ P^(Qlm ^ ^ 6Xp 

m=l 

Set Em = \t^nmrm- Then 


2nmp2(^'L 


M 


M 


Pr(Qi >-T Ea 1 - E 


exp 


nmrl,{p 


2 rprn^2 


32p™ct2 


(S1.5) 


For Q 2 , since u'^~^X^~^WmX^u'^ > rimP^, we have 


M 


Q 2 > - r^UmP^. 


(S1.6) 


m=l 


Term can be dealt with as follows. By the Triangle inequality and (X 

i=l i=l 


18 










for any sequence n*, we have 


11/3* + r o u,\\^ - \\f3*\\^ < ||r o I. 

j&s jes 


'Jll2 


- = VMa 

V ies 

Therefore, we have that term satishes 

IQal < nX^/l^l^r 


M 


M 


\ m=l 


m=l 


M 


m=l 


Combining flS1.4p . fISl.Sp . flSl.hp . and fISl.Tp . we have 


M M 

Q{us) > - := L{r) 


m=l 


m=l 


(S1.7) 


(S1.8) 


M 

with probability at least 1 — X] ^xp 

m=l 


1 ^ 32p^al ) ■ 


Recall that 


Then 


M 

L{r) > 0 with probability at least 1 — X] 

m=l 


V 2o-^pJ*n„ 


. Therefore, fISl.Sp is proved, and 


Part 1 of Theorem 1 is established. 

Now consider Part 2. By the Karush-Kuhn-Tucher(KKT) conditions, we need to prove 
that for m = 1, • • • , M, 


' ^ II/3b||2 

\\Xl{y--X^^^)\\^<n\. 


(S1.9) 
(SI.10) 


Then ^ai^sso ^ ^^g^asso^ ^g^asso^ ^g^asso ^ ^g^asso ^ Q jg local minimizer of (3). 
From Part 1, (3s minimizes 


L{(3b) 


1 

2n 


m=l 


\y 


-x^o: 


m II2 

S II 


+ Wf^jh- 

j&s 


Therefore, flSl.911 holds, together with flS1.2l) which also yields 



m* _ 

s — 


( -\rniT -ym\ ^ 

[^S ) 




(Sl.ll) 
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Note that 


- X^JX^0^ - /3; 

Substituting fISl.llD into flS1.12D . we obtain 


m* ^ 
S ) 


-'S hloo 


< 


_, ( am 

X'^7Wr^l‘^e^ - x^7x^ {X'fX'^) Xf^Wm^l'^e^ - n\-K^ 

[ Wf^sh 

X^JWj^^e^ + X^JX^ {X^^X^y^ 


+nX 


_, am 

^mT \rm ( \rmT vm\ -*- 


-ymT -\rm ( -\rrriT 

A^c A^ (^A^ A^ j 


< \\xywrney 


\\^b\\2 

xyw^x'^ {xyw^x'^) 


-1 


\xywmey 


+nX 


xywmX^ {xyw^x^) 


-1 




CX) 

1 /3b| 2 


< \\xyWmsyi^+iJm \\xywmeyy+nx^r 

By the condition < Dm < 1, if 

l-D„ 


OO 


IX^^WmD 


< nX 


1 + Dm 


then from (ISl.lSp it follows 

\\Xy^{y^-X^^ 


!^)||oo < WX^^WmS 


^{1 +'ljjm) +nX'ljjr. 
< nA(l — Dm) + nXDm = nX. 


We now derive the probability bounds for the event in flSl.ldp . By the Bonferroni’s 
and sub-Gaussian tail probability bound in Condition 1, 


Pr|||X™^lW„e™|l >nA^-for m = 1, • • • , M 

I °° 1 + Dm 

I- Dr 


( 1 _ n ') 

< p^Pr |Xf^W^e-|>nA-—^ 
^=1 L i + J 


M 


< 2p exp 

m=l 


n^X\l - Dm)^ 


2n^Ama^(l + Dm)"^ 

Then Part 2 is established by combining Parti, fISl.Qp . fISl.lOp . and flSl.lSP 


(SI.12) 


(51.13) 

(51.14) 

inequality 

(51.15) 

□ 
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Proof of Theorem 2. Recall that (3^ = argmin^^iJ(/3e), where 




1 

2n 


E 

m=l 




Let with Rm G (0, cx)) and J = {/3 b : -/3^*||2 = rm,m = 1, - ■ ■ ,M}. 

Similar as the proof of part 1 in Theorem 1, if we can prove 


M 


Pr > H(0i)) > 1 - exp ^ -A 


IsKp;' 


m\2 


m=l 




(SI.16) 


then H{(3i3) has a local minimum /3 b that satishes 11/3™ — /3™*||2 < Tm, m = 1, • • • , M with 
probability at least 1 — X] 

I ®P2 


m=l 


Together with fISl.lD and flS1.2p . we have 


M 

m=l 


M 


+5 E (/ 3 ? - - 0T 

m=l 

=: /^i + i/2, 


For 3^2, since Amin } = P™ and 11/3^^ - /3^*||2 = we have 


.1 

M 


^2 > - 


m=l 


For FTi we have for any 


(SI.17) 


(SI.18) 


M 


M 


Pr(i7i <-^ rmSm) < ^ exp 


^2 ^2 
m^m 


m=l 


^ 2a^||fFmX™(/3g‘-/3r)|li, 

< ^exp 


m=l 




2nmP^o-^ 


The hrst inequality holds due to the sub-Gaussian tail probability under Condition 1, and 
the last inequality holds due to the fact that ||hFmX™(/3™ — /3™*)||2 < UmP^rl^- Set Em = 
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(SI.19) 


l/f^UrnTm. Then 


M 


M 


Sp'^al 


Pt{Hi > -t^YI > 1 - ^ exp 


m=l 


m=l 


Recall that = yCombining fISl.lTp . flSl.lSp and flSl.lQp . we have flSl.lhp holds. 
This complete the proof of Part 1. 

Next, we prove Part 2. By the Karush-Kuhn-Tucher(KKT) conditions, we need to prove 
that satishes 


/3: 


X-T lym _ + Up'^msh) O = 0’ 


ll/^elh 


||X^J(2/--XF/3^)|U<np',(0+). 


If min||/3j||2 > 6'A, Pa(||;9b|| 2) = 0, and certainly flS1.20p holds. Dehne 

jeS 


min 11/3*112 ^ 

Dt < 2^ _ rhui 

2^m Vi^r 

min 11 / 3 * II 2 

Note that A < -——. Therefore, we can conclude the event 


29 


fin 


(51.20) 

(51.21) 


belongs to the event min \\(3j\\2 > ^'A [■. That is. 


Pr 


I mm 11/3,II 2 > 0a| > Pr |^||/3^ - ^Th < ^ = 1, • • 


•,M 


> 1 — exp 

m=l 


-flii 


8<p5* ” 

Now consider the probability of 

||X-T(j/- - < v,(0+), for m = 1 , • • • , M. 


(SI.22) 


(S1.23) 


22 



















Note that 


X™' - X^/3^) = Xp - Xp W^X^{f3^ - f3^ 


3 m*'] 

S ) 


(S1.24) 


Combining flS1.23p and flS1.24p . we can obtain 

\\X^J{y^-X^$P\P 

X^J Wme^ - X^JWmX^ {X^^WmXp-^ X^^Wme 

X^JW^X^ {X^^Wr^Xp-^ X^^W^e 


< \\XpWrr^epP + 


oo 

m 


< WXpW^ep 


xpw^x^ {Xpw^xp 


-1 


\XpWmep 


< \\XpWmepP + ^l^m\\XpWme^ 

< \\X^^WmeP\{l + i:p. 


If 


ix™ny„.’"ii <"Pa(o+) 


(1 + ^m) 


(SI.25) 


(S1.26) 


then from flS1.25D it follows 


l|Jf?J(H’"-.>f?/3?)ll=o < "y( (l + V>„)<np',(0+), 

(1 + Ipm) 

which proves flS1.2ip . We now derive the probability bounds for the event in flS1.26p . In fact, 
by the Bonferroni’s inequality and sub-Gaussian tail probability bound under Condition 1, 


Pr<' ||X”*^iy^e^ 


(1 + 1pm) 

. o ^ f ^Va(0+) 

2n„A„crl(l+i/>„y 




Part (2) is proved by combining flS1.20p . flS1.2ip . flS1.22p . and flS1.27p 


(SI.27) 

□ 


Proof of Theorem 3. The proof is similar to that of Part 1 of Theorem 2 and is omitted 
here. □ 
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Proof of Theorem 4. By the Karush-Kuhn-Tucher(KKT) conditions, we need to prove 
that /3 satishes 


M 




V£./3S„)+"Po,AyEP'.^<(fel))°PU,(fel) = 0. (Sl'28) 


m=l 


M 


lx?-sjy"' - AX/3?JI < "Pd,(0+)P'o.A„(E P'.a,(I/3?-sJ)). 

m=l 

\\X^7{y- - XZ^lJWoo < Vo,Ao(0+Ka,(0+). 


(51.29) 

(51.30) 


If min |/3™| > 0/A/, then P/^Ajd/^sLl) ^ R-^call the dehnition of the estimator We 

j^Srr ’ 


can easily get (]S1.28p . Set 


min \j3 
^ U,m)eA 


m* I 


Ur, 




min |/3J**| 

Note that A/ < -. Therefore, 


(i,m)e.4 
M 


Pr , min ^ |/3-| > 0/A/ > Pr ||/3^^ - < 


|.S„ 


Ur, 


-CL, m = 1, ■ ■ ■ ,M 


> l-2^exp -C£ 


m=l 


Spral 


.*mA2 


(S1.31) 


In fact. 


XfJs^{y--XZm 

— U/ U/ Y'‘ 

— y\. g-STTiVV ^ 5_5'm KKmyv^ 

and — f3 

^X^JWme^. Then we have 


(y^-X^ 

/3L)I 

< 

+1 


< 

iX^Jsrr^Wme^l + 

(xgJwA^xg,)- 

< 


< 

l!^S‘^‘r„£'”|L(i+C)- 

Hence f|S1.29p holds when 







< 


np'i,x,{0+)p'o,Xo{J-^frn 
(1 + Vm) 


(SI.32) 


(S1.33) 
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That is because for m = 1, • • • , M, 


f^s )l < 


«P',,A,(0 + )P'o,Ao(-^“’”/r“) 


(1 + ’I’m) 


(l + € 


< «P',,A,(')+)P0,A„(^“"Ya'”“)(1 + ’I’m) 

/ M 

< np;A,(0+yo,Ao 5 ^P/.Ap(I^S-5,„I) 


v^rn=l 


We now derive the probability bounds for the event in flS1.33p . In fact, by Bonferroni’s 
inequality and sub-Gaussian tail probability bound in Condition 1, 


T „ np'r. (o+yoA^(j""*/r“) 

Pr <! |lX™^IWne™|L > ’ ! : m. - - —3 m e {1, • • • , M} 


(1 + 


M 


< 


2|S|E 


exp 


^V7 a,(o+)p'(5,ao('^ ""fi 


j—m ^max^ 


Similarly, we can prove (lS1.30p . Actually, 

mT T 


(S1.34) 


{y- - XZf3Z)\oo < 11 ^ 5 ^' 




< 11X^7+ 7w!^Vo,Ao(0+Kap(0+)- 


(1 + 7;; 


Based on the above discussions, (1S1.30p follows if ||X™7lP; 
probability bound is derived as 


rnTri/ ,m|| ^ Vq.Aq(0+H.a7°+) 


(1+hh) 


Pr<! ||X^7lWne”^||^ > 3 m e {I,-- - ,M} 


(1 + 7;; 


M 


< 2(p-|SI)Wexp^-!Ah+±M+A)\ 


m=l 


(SI.35) 


Therefore, the theorem is proved by combining flS1.28p . flS1.29p . flS1.29p . flSl.3ip . flSl.34p and 
flSObll . □ 


Proof of Theorem 5. By the Karush-Kuhn-Tucher(KKT) conditions, we need to prove 
that 7 satisfies 


-^57: (y^ - XsJZ) + np[,,,mj2) o 
+ V2,A7fel)°«gn(/3L) = o> 


/3: 


ll/3s„ 


(SI.36) 
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(51.37) 

(51.38) 


- XfJlJWo. < M,a.( 0 +), 

- XZ^lJWo. < npU,(0+) + M,,^(0+). 

Note that 3^ satisfies —X'o^ (y"^ — X27 /3^ ) = 0. If 

min |/9™| > 62 X 2 and min||/3j||2 > 6'iAi, 
jes m ^ jes 


then we have flS1.36D . Set C}^ < 


Ai< 


min 1/3’?“ 1 
(j,m)eA ? 

2 \ 

/ nm 

/ |5^|* 

Note that 

min 11/3/II 2 

jes 

A 2 < 

min LS/* 
U,m)eA ^ 

201 

202 


Therefore, if min |/3j"| > 62 X 2 ^ then we must have min ||/ 3 j ||2 > ^lAi- 


{j,m)GA 


jes 


Pr < min |/5™| > 0iAi, min ||/ 3 j ||2 > 6 ' 2 A 2 




j&S 


|S„ 




M 

> 1 - 2 ^ exp 

m=l 


Tim 

*m \2 


n \2 






Similar as the proof of Theorem 4, flS1.37p holds when 

V 2 ,A 2(0 + ) 
(l + V'm) ' 

V 2 ,A 2 ( 0 +) 


Then we have 


Pr<!||X-^lT^e-||^> 


M 


(1 

^V|a,( 0 +) 


, 3 m e { 1 , • • • , M} 


- £ “Pi 


(S1.39) 


(S1.40) 


(S1.41) 


Similarly, we can show flS1.38p holds when ||X™7llAne”^||^ < • The prob¬ 

ability bound is derived as 


Pr<! 11X^7lT^e"^|L> 


Vl,Ai (0 + ) + V 2 ,A 2(0 + ) 

(1 + 77 ) 


, 3 m e {1, • • • , M] 
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M 


< 


2(p-|S|)^exp 


m=l 


^^[Pl,Ai(0+) +P2,A2(0+)]^ 

2nmA„,a^(l +-0^)2 


(S1.42) 


Therefore, the theorem is proved by combining flSl.3611 . flSl.3711 . flSl.3711 . flSl.3911 . flSl.4111 . 
and flS1.42p . □ 
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S2 Additional Numerical Results 

Table S2.1: Analysis of lung cancer data using SGMCP: identified genes and their estimates. 


Probe 

Gene 

UM 

HIM 

DFGI 

MSKGG 

201462_at 

SGRNl 


0.0034 


0.0020 

202831_at 

GPX2 


-0.0022 


-0.0021 

203917_at 

GXADR 

0.0021 


0.0004 

0.0066 

205776_at 

FM05 

0.0005 

0.0035 

0.0038 


206754^_at 

GYP2B6 

0.0012 


0.0020 


207850_at 

GXGL3 


-0.0216 


0.0120 

208025^_at 

HMGA2 

-0.0028 

0.0001 

-0.0037 

-0.0012 

219654_at 

PTPLA 

-0.0025 

-0.0145 


0.0055 

219764 at 

FZDIO 

-0.0005 

-0.0019 

-1.6E-05 

-0.0022 





Table S2.2: 
estimates. 


Analysis of lung cancer data using met a-analysis: identified genes and their 


Probe 

Gene 

UM 

HIM 

DFGI 

MSKGG 

201462_at 

SCRNl 


0.0101 



203559_s_at 

ABPl 


0.0005 



203876_s_at 

MMPll 




-0.0066 

203921_at 

CHST2 

0.0051 




204855_at 

SERPINB5 

-0.0012 




206754_s_at 

CYP2B6 



0.0104 


206994_at 

CST4 


-0.0037 



207850_at 

CXCL3 


-0.0246 



208025_s_at 

HMGA2 

-0.0021 


-0.0010 


209343_at 

EFHDl 




0.0096 

212328_at 

LIMGHl 



0.0050 


213703_at 

LING00342 

0.0008 




215867jc_at 

GA12 


-0.0026 



218677_at 

S100A14 




-0.0257 

218824_at 

PNMALl 

0.0003 




219654_at 

PTPLA 


-0.0240 



219747_at 

NDNF 

0.0002 




220952_s_at 

PLEKHA5 


-0.0047 



221841_s_at 

KLF4 

-0.0047 




222043 at 

GLU 


0.0049 
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Table S2.3: Analysis of lung cancer data using pooled analysis: identified genes and their 
estimates. _ 


Probe 

Gene 

UM 

HIM 

DFGI 

MSKGG 

201462_at 

SGRNl 


0.0101 



203559_s_at 

ABPl 


0.0005 



203876_s_at 

MMPll 




-0.0066 

203921_at 

GHST2 

0.0051 




204855_at 

SERPINB5 

-0.0012 




206754_s_at 

GYP2B6 



0.0104 


206994_at 

GST4 


-0.0037 



207850_at 

GXGL3 


-0.0246 



208025_s_at 

HMGA2 

-0.0021 


-0.0010 


209343_at 

EFHDl 




0.0096 

212328_at 

LIMGHl 



0.0050 


213703_at 

LING00342 

0.0008 




215867jc_at 

GA12 


-0.0026 



218677_at 

S100A14 




-0.0257 

218824_at 

PNMALl 

0.0003 




219654_at 

PTPLA 


-0.0240 



219747_at 

NDNF 

0.0002 




220952_s_at 

PLEKHA5 


-0.0047 



221841_s_at 

KLF4 

-0.0047 




222043 at 

GLU 


0.0049 




Table S2.4: Analysis of lung cancer data using GMCP: identified genes and their estimates. 


Probe 

Gene 

UM 

HLM 

DFGI 

MSKGG 

202503^_at 

KIAAOlOl 

-0.0009 

-0.0020 

-0.0021 

-0.0019 

205776_at 

FM05 

0.0001 

0.0002 

0.0002 

-0.0001 

207850_at 

GXGL3 

-0.0017 

-0.0139 

0.0029 

0.0095 

208025^_at 

HMGA2 

-3.2E-05 

l.lE-05 

-3.8E-05 

-2.2E-05 

219654_at 

PTPLA 

-0.0036 

-0.0092 

-0.0024 

0.0060 

219764 at 

FZDIO 

-0.0014 

-0.0036 

-0.0014 

-0.0036 
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